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Abstract 

We address phylogenetic reconstruction when the data is generated from a mixture 
distribution. Such topics have gained considerable attention in the biological community 
with the clear evidence of heterogeneity of mutation rates. In our work we consider data 
coming from a mixture of trees which share a common topology but differ in their edge 
weights (i.e., branch lengths). We first show the pitfalls of popular methods, including 
maximum likelihood and Markov chain Monte Carlo algorithms. We then determine 
in which evolutionary models, reconstructing the tree topology, under a mixture distri- 
bution, is (im)possible. We prove that every model whose transition matrices can be 
parameterized by an open set of multi-linear polynomials, either has non-identifiable 
mixture distributions, in which case reconstruction is impossible in general, or there 
exist linear tests which identify the topology. This duality theorem, relies on our notion 
of linear tests and uses ideas from convex programming duality. Linear tests are closely 
related to linear invariants, which were first introduced by Lake, and are natural from 
an algebraic geometry perspective. 

1 Introduction 

A major obstacle to phylogenetic inference is the heterogeneity of genomic data. For exam- 
ple, mutation rates vary widely between genes, resulting in different branch lengths in the 
phylogenetic tree for each gene. In many cases, even the topology of the tree differs between 
genes. Within a single long gene, we are also likely to see variations in the mutation rate, 
see |1U] for a current study on regional mutation rate variation. 
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Our focus is on phylogenetic inference based on single nucleotide substitutions. In this 
paper we study the effect of mutation rate variation on phylogenetic inference. The exact 
mechanisms of single nucleotide substitutions are still being studied, hence the causes of 
variations in the rate of these mutations are unresolved, see [§1 1241 IT7| . In this paper we 
study phylogenetic inference in the presence of heterogeneous data. 

For homogenous data, i.e., data generated from a single phylogenetic tree, there is consid- 
erable work on consistency of various methods, such as likelihood [1] and distance methods, 
and inconsistency of other methods, such as parsimony [7]. Consistency means that the 
methods converge to the correct tree with sufficiently large amounts of data. We refer the 
interested reader to Felsenstein (Hj for an introduction to these phylogenetic approaches. 

There are several works showing the pitfalls of popular phylogenetic methods when data 
is generated from a mixture of trees, as opposed to a single tree. We review these works 
in detail shortly. The effect of mixture distributions has been of marked interest recently 
in the biological community, for instance, see the recent publications of Kolczkowski and 
Thornton [13] . and Mossel and Vigoda [18] . 

In our setting the data is generated from a mixture of trees which have a common tree 
topology, but can vary arbitrarily in their mutation rates. We address whether it is possible 
to infer the tree topology We introduce the notion of a linear test. For any mutational 
model whose transition probabilities can be parameterized by an open set (see the following 
subsection for a precise definition), we prove that the topology can be reconstructed by 
linear tests, or it is impossible in general due to a non-identifiable mixture distribution. 
For several of the popular mutational models we determine which of the two scenarios 
(reconstruction or non-identifiability) hold. 

The notion of a linear test is closely related to the notion of linear invariants. In fact, 
Lake's invariants are a linear test. There are simple examples where linear tests exist and 
linear invariants do not (in these examples, the mutation rates are restricted to some range). 
However, for the popular mutation models, such as Jukes-Cantor and Kimura's 2 parameter 
model (both of which are closed under multiplication) we have no such examples. For the 
Jukes-Cantor and Kimura's 2 parameter model, we prove the linear tests are essentially 
unique (up to certain symmetries). In contrast to the study of invariants, which is natural 
from an algebraic geometry perspective, our work is based on convex programming duality. 

We present the background material before formally stating our new results. We then 
give a detailed comparison of our results with related previous work. 

An announcement of the main results of this paper, along with some applications of the 
technical tools presented here, are presented in [23] . 

1.1 Background 

A phylogenetic tree is an unrooted tree T on n leaves (called taxa, corresponding to n 
species) where internal vertices have degree three. Let E(T) denote the edges of T and 
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V(T) denote the vertices. The mutations along edges of T occur according to a continuous- 
time Markov chain. Let £1 denote the states of the model. The case |0| = 4 is biologically 
important, whereas |fi| = 2 is mathematically convenient. 

The model is defined by a phylogenetic tree T and a distribution ir on f2. Every edge 
e has an associated |0| x |fi| rate matrix R e , which is reversible with respect to tt, and 
a time t e . Note, since R e is reversible with respect to tt, then it is the stationary vector 
for R e (i.e., 7rR e = 0). The rate matrix defines a continuous time Markov chain. Then, 
R e and t e define a transition matrix P e = exp(t e R e ). The matrix is a stochastic matrix of 
size \Q\ x and thus defines a discrete-time Markov chain, which is time-reversible, with 
stationary distribution tt (i.e., irP e = it). 

Given P = (P e ) e £E(T) we then define the following distribution on labellings of the 
vertices of T. We first orient the edges of T away from an arbitrarily chosen root r of the 
tree. (We can choose the root arbitrarily since each P e is reversible with respect to tt.) 
Then, the probability of a labeling I : V(T) — > Q is 

//-(£) = 7r(£(r)) n Puv(£(u),£(v)). (1) 

w>&E(T) 

Let \i T j> be the marginal distribution of fi'^ on the labelings of the leaves of T -p 
is a distribution on £l n where n is the number of leaves of T). The goal of phylogeny recon- 
struction is to reconstruct T (and possibly P ) from n T j> (more precisely, from independent 
samples from H T ^)- 

The simplest four-state model has a single parameter a for the off-diagonal entries of 
the rate matrix. This model is known as the Jukes-Cantor model, which we denote as JC. 
Allowing 2 parameters in the rate matrix is Kimura's 2 parameter model which we denote as 
K2, see Section f5. 21 for a formal definition. The K2 model accounts for the higher mutation 
rate of transitions (mutations within purines or pyrimidines) compared to transversions 
(mutations between a purine and a pyrimidine). Kimura's 3 parameter model, which we 
refer to as K3 accounts for the number of hydrogen bonds altered by the mutation. See 
Section[7]for a formal definition of the K3 model. For \Q\ = 2, the model is binary and the 
rate matrix has a single parameter a. This model is known as the CFN (Cavender-Farris- 
Neyman) model. For any examples in this paper involving the CFN, JC, K2 or K3 models, 
we restrict the model to rate matrices where all the entries are positive, and times t e which 
are positive and finite. 

We will use A4 to denote the set of transition matrices obtainable by the model under 
consideration, i.e., 

Ai = {P e = exp(t e R e : t e and R e are allowed in the model }. 

The above setup allows additional restrictions in the model, such as requiring t e > which 
is commonly required. 
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In our framework, a model is specified by a set Ai, and then each edge is allowed any 
transition matrix P e £ Ai . We refer to this framework as the unrestricted framework, since 
we are not imposing any dependencies on the choice of transition matrices between edges. 
This set-up is convenient since it gives a natural algebraic framework for the model as we 
will see in some later proofs. A similar set-up was required in the work of Allman and 
Rhodes ^P, also to utilize the algebraic framework. 

An alternative framework (which is typical in practical works) requires a common rate 
matrix for all edges, specifically R = R e for all e. Note we can not impose such a restriction 
in our unrestricted framework, since each edge is allowed any matrix in Ai. We will refer 
to this framework as the common rate framework. Note, for the Jukes-Cantor and CFN 
models, the unrestricted and common rate frameworks are identical, since there is only a 
single parameter for each edge in these models. We will discuss how our results apply to 
the common rate framework when relevant, but the default setting of our results is the 
unrestricted model. 

Returning to our setting of the unrestricted framework, recall under the condition t e > 
the set Ai is not a compact set (and is parameterized by an open set as described shortly). 
This will be important for our work since our main result will only apply to models where 
Ai is an open set. Moreover we will require that Ai consists of multi-linear polynomials. 
More precisely, a polynomial p £ M.[x\, . . . ,x m ] is multi-linear if for each variable X{ the 
degree of p in Xj is at most 1. Our general results will apply when the model is a set 
of multi-linear polynomials which are parameterized by an open set which we now define 
precisely. 

Definition 1. We say that a set Ai of transition matrices is parameterized by an open set 

if there exists a finite set $7, a distribution ir over Q, an integer m, an open set O C M. m , 
and multi-linear polynomials pij 6 M[xi, . . . , x m ] such that 

M = {(pij)f J=1 1 (xi, . . . , x m ) e O}, 

where Ai is a set of stochastic matrices which are reversible with respect to ir (thus ir is 
their stationary distribution). 

Typically the polynomials p^ are defined by an appropriate change of variables from 
the variables defining the rate matrices. Some examples of models that are paraemeterized 
by an open set are the general Markov model considered by Allman and Rhodes pQ; Jukes- 
Cantor, Kimura's 2-parameter and 3-parameter, and Tamura-Nei models. For the Tamura- 
Nei model (which is a generalization of Jukes-Cantor and Kimura's models) we show in |23| 
how the model can be re-parameterized in a straightforward manner so that it consists of 
multi-linear polynomials, and thus fits the parameterized by an open set condition (assuming 
the additional restriction t e > 0). 
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1.2 Mixture Models 



In our setting, we will generate assignments from a mixture distribution. We will have 
a single tree topology T, a collection of k sets of transition matrices Pi, P2, ■ ■ ■ , Pk where 
Pi e M E ^ and a set of non-negative reals q±, qi, ■ ■ ■ , qk where YliU = 1- We then consider 
the mixture distribution: 

i 

Thus, with probability q^ we generate a sample according to /Vp ■ Note the tree topology 
is the same for all the distributions in the mixture (thus there is a notion of a generating 
topology). In several of our simple examples we will set k = 2 and qi = 1/2, thus we will 
be looking at a uniform mixture of two trees. 

1.3 Maximum Likelihood and MCMC results 

We begin by showing a simple class of mixture distributions where popular phylogenetic 
algorithms fail. In particular, we consider maximum likelihood methods, and Markov chain 
Monte Carlo (MCMC) algorithms for sampling from the posterior distribution. 

In the following, for a mixture distribution fj,, we consider the likelihood of a tree T as, 
the maximum over assignments of transition matrices P to the edges of T, of the probability 
that the tree (T, P) generated [i. Thus, we are considering the likelihood of a pure (non- 
mixture) distribution having generated the mixture distribution [i. More formally, the 
maximum expected log-likelihood of tree T for distribution \i is defined by 

£t(aO = max C t -0(/j), 
PeM E ' 

where 

C T,p(^ = 2 M(j/)MM Tj jj(y)) 

Recall for the CFN, JC, K2 and K3 models, M is restricted to transition matrices 
obtainable from positive rate matrices R e and positive times t e . 

Chang [3] constructed a mixture example where likelihood (maximized over the best 
single tree) was maximized on the wrong topology (i.e., different from the generating topol- 
ogy). In Chang's examples one tree had all edge weights sufficiently small (corresponding 
to invariant sites). We consider examples with less variation within the mixture and fewer 
parameters required to be sufficiently small. We consider (arguably more natural) examples 
of the same flavor as those studied by Kolaczkowski and Thornton who showed exper- 
imentally that in the JC model, likelihood appears to perform poorly on these examples. 

Figure ^ shows the form of our examples where C and x are parameters of the example. 
We consider a uniform mixture of the two trees in the figure. For each edge, the figure 
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shows the mutation probability, i.e., it is the off-diagonal entry for the transition matrix. 
We consider the CFN, JC, K2 and K3 models. 




Figure 1: In the binary CFN model, for all choices of C and x sufficiently small, maximum 
likelihood is inconsistent on a mixture of the above trees. 

We prove that in this mixture model, maximum likelihood is not robust in the following 
sense: when likelihood is maximized over the best single tree, the maximum likelihood 
topology is different from the generating topology. 

In our example all of the off-diagonal entries of the transition matrices are identical. 
Hence for each edge we specify a single parameter and thus we define a set P of transition 
matrices for a 4-leaf tree by a 5-dimensional vector where the i-th. coordinate is the param- 
eter for the edge incident leaf i, and the last coordinate is the parameter for the internal 
edge. 

Here is the statement of our result on the robustness of likelihood. 



Theorem 2. Let C e (0,l/|n|). Let Pi = (C + x,C - x,C - x,C + x,x 2 ) and P 2 = 
(C — x, C + x, C + x, C — x, x 2 ). Consider the following mixture distribution on T%: 



1. In the CFN model, for all C £ (0, 1/2), there exists xq > such that for all x £ (0, xq) 
the maximum-likelihood tree for \x x is T±. 

2. In the JC, K2 and K3 models, for C = 1/8, there exists xq > such that for all 
x £ (0, xq) the maximum-likelihood tree for \i x is T\. 

Recall, likelihood is maximized over the best pure (i.e., non-mixture) distribution. 

Note, for the above theorem, we are maximizing the likelihood over assignments of valid 
transition matrices for the model. For the above models, valid transition matrices are those 
obtainable with finite and positive times t e , and rate matrices R e where all the entries are 
positive. 

A key observation for our proof approach of Theorem [2 is that the two trees in the 
mixture example are the same in the limit x — > 0. The x = case is used in the proof for 
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the x > case. We expect the above theorem holds for more a general class of examples 
(such as arbitrary x, and any sufficiently small function on the internal edge), but our proof 
approach requires x sufficiently small. Our proof approach builds upon the work of Mossel 
and Vigoda [T8| . 

Our results also extend to show, for the CFN and JC models, MCMC methods using NNI 
transitions converge exponentially slowly to the posterior distribution. This result requires 
the 5-leaf version of mixture example from Figure ^ We state our MCMC result formally in 
Theorem |3U in Section |H] after presenting the background material. Previously, Mossel and 
Vigoda |18j showed a mixture distribution where MCMC methods converge exponentially 
slowly to the posterior distribution. However, in their example, the tree topology varies 
between the two trees in the mixture. 

1.4 Duality Theorem: Non-identifiablity or Linear Tests 

Based on the above results on the robustness of likelihood, we consider whether there are any 
methods which are guaranteed to determine the common topology for mixture examples. 
We first found that in the CFN model there is a simple mixture example of size 2, where 
the mixture distribution is non-identifiable. In particular, there is a mixture on topology 
T\ and also a mixture on T3 which generate identical distributions. Hence, it is impossible 
to determine the correct topology in the worst case. It turns out that this example does 
not extend to models such as JC and K2. In fact, all mixtures in JC and K2 models are 
identifiable. This follows from our following duality theorem which distinguishes which 
models have non-identifiable mixture distributions, or have an easy method to determine 
the common topology in the mixture. 

We prove, that for any model which is parameterized by an open set, either there exists 
a linear test (which is a strictly separating hyperplane as defined shortly), or the model has 
non-identifiable mixture distributions in the following sense. Does there exist a tree T, a 
collection P\, . . . ,Pk, and distribution Pi, • • • , such that there is another tree T" 7^ T, a 
collection P{, . . . , P£ and a distribution p[, . . . ,p' k where: 



Thus in this case it is impossible to distinguish these two distributions. Hence, we can not 
even infer which of the topologies T or T' is correct. If the above holds, we say the model 
has non-identifiable mixture distributions. 

In contrast, when there is no non-identifiable mixture distribution we can use the follow- 
ing notion of a linear test to reconstruct the topology. A linear test is a hyperplane strictly 
separating distributions arising from two different 4 leaf trees (by symmetry the test can 
be used to distinguish between the 3 possible 4 leaf trees). It suffices to consider trees with 
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4 leaves, since the full topology can be inferred from all 4 leaf subtrees (Bandelt and Dress 
E9)- 

Our duality theorem uses a geometric viewpoint (see Kim |12j for a nice introduction 
to a geometric approach). Every mixture distribution fiona 4- leaf tree T defines a point 
z G where TV" = |fi| 4 . For example, for the CFN model, we have z = (z\, . . . , z 2 i) and 
z 1 = /i(0000),z 2 = /i(0001), z 3 = /i(0010),... ,z 2 4 = //(llll). Let Q denote the set of 
points corresponding to distributions /t(Tj, P) for the 4-leaf tree Tj, i = 1,2, 3. A linear 
test is a hyperplane which strictly separates the sets for a pair of trees. 

Definition 3. Consider the 4-leaf trees T 2 and T3. A linear test is a vector t S M^l 4 such 
that t T fi2 > for any mixture distribution [12 arising from T2 and t T fis < for any mixture 
distribution /t3 arising from T3. 

There is nothing special about T2 and T3 - we can distinguish between mixtures arising 
from any two 4 leaf trees, e.g., if t is a test then i( 13 ) distinguishes the mixtures from T± 
and the mixtures from T2, where (13) swaps the labels for leaves 1 and 3. More precisely, 
for all (01,02,03,04) G |f]| 4 , 

t {13) -s (2) 

L 0,1,0,2,0,3, an ~ • 5 a4,a2,a3,ai 

Theorem 4. For any model whose set M of transition matrices is parameterized by an 
open set (of multilinear polynomials), exactly one of the following holds: 

• there exist non-identifiable mixture distributions, or 

• there exists a linear test. 

For the JC and K2 models, the existence of a linear test follows immediately from Lake's 
linear invariants ^3]. Hence, our duality theorem implies that there are no non-identifiable 
mixture distributions in this model. In contrast for the K3 model, we prove there is no 
linear test, hence there is an non-identifiable mixture distribution. We also prove that in 
the K3 model in the common rate matrix framework, there is a non-identifiable mixture 
distribution. 

To summarize, we show the following: 

Theorem 5. 

1. In the CFN model, there is an ambiguous mixture of size 2. 

2. In the JC and K2 model, there are no ambiguous mixtures. 

3. In the K3 model there exists a non-identifiable mixture distribution ( even in the com- 
mon rate matrix framework) . 
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Steel, Szekely and Hendy j^2] previously proved the existence of a non-identifiable mix- 
ture distribution in the CFN model, but their proof was non-constructive and gave no bound 
on the size of the mixture. Their result had the more appealing feature that the trees in 
the mixture were scalings of each other. 

Allman and Rhodes recently proved identifiability of the topology for certain classes 
of mixture distributions using invariants (not necessarily linear) . Rogers |2J proved that the 
topology is identifiable in the general time-reversible model when the rates vary according to 
what is known as the invariable sites plus gamma distribution model. Much of the current 
work on invariants uses ideas from algebraic geometry, whereas our notion of a linear test 
is natural from the perspective of convex programming duality. 

Note, that even in models that do not have non-identifability between different topolo- 
gies, there is non-identifiability within the topology. An interesting example was shown by 
Evans and Warnow [£]. 

1.5 Outline of Paper 

We prove, in Section|3J Theorem |1] that a phylogenetic model has a non-identifiable mixture 
distribution or a linear test. We then detail Lake's linear invariants in Section and 
conclude the existence of a linear test for the JC and K2 models. In Sections |H1 and [7] we 
prove that there are non-identifiable mixtures in the CFN and K3 models, respectively. We 
also present a linear test for a restricted version of the CFN model in Section 16.31 We 
prove the maximum likelihood results stated in Theorem [21 in Section |HJ The maximum 
likelihood results require several technical tools which are also proved in Section |H1 The 
MCMC results are then stated formally and proved in Sectional 

2 Preliminaries 

Let the permutation group S4 act on the 4-leaf trees {Ti,T2,T%} by renaming the leaves. 
For example (14) £ S4 swaps T2 and T3, and fixes T\. For tt £ S n , we let T w denote tree 
T permuted by tt. It is easily checked that the following group K (Klein group) fixes every 



Note that K < S4, i. e., K is a normal subgroup of S4 . 

For weighted trees we let S4 act on (T, P) by changing the labels of the leaves but 
leaving the weights of the edges untouched. Let tt £ S n and let T' = T 71 ". Note that the 
distribution fPr> w is just a permutation of the distribution ht,w- 



# = {(), (12)(34),(13)(24),(14)(23)} < 5 4 . 



(3) 




• • -, a ir n ) = /V,p(«l; • • • » a n)- 



(4) 
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The actions on weighted trees and on distributions are compatible: 




3 Duality Theorem 

In this section we prove the duality theorem (i.e., Theorem |1J). 

Our assumption that the transition matrices of the model are parameterized by an open 
set implies the following observation. 

Observation 6. For models parameterized by an open set, the coordinates of fJ^w are 
multi-linear polynomials in the parameters. 

We now state a classical result that allows one to reduce the reconstruction problem to 
trees with 4 leaves. Note there are three distinct leaf-labeled binary trees with 4 leaves. We 
will call them T\ , , T3 see Figure |2J 



For a tree T and a set S of leaves, let T\$ denote the induced subgraph of T on S where 
internal vertices of degree 2 are removed. 

Theorem 7 (Bandelt and Dress [2j). For distinct leaf-labeled binary trees T and T' 
there exist a set S of 4 leaves where T\g ^T'\$. Hence, the set of induced subgraphs on all 
4-tuples of leaves determines a tree. 

The above theorem also simplifies the search for non-identifiable mixture distributions. 

Corollary 8. If there exists a non-identifiable mixture distribution then there exists a non- 
identifiable mixture distribution on trees with 4 leaves. 

Recall from the Introduction, the mixtures arising from Tj form a convex set in the 
space of joint distributions on leaf labelings. A test is a hyperplane strictly separating the 
mixtures arising from T2 and the mixtures arising from T3. For general disjoint convex 
sets a strictly separating hyperplane need not exist (e.g., take C\ = {(0,0)} and C2 = 
{(0, y) I y > 0} U {(x,y) \ x > 0}). The sets of mixtures are special - they are convex hulls 
of images of open sets under a multi-linear polynomial map. 




Figure 2: All leaf-labeled binary trees with n = 4 leaves. 
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Lemma 9. Letp\{x\, . . . ,x n ), . . . ,p m (xi, . . . ,x n ) be multi-linear polynomials in x±, . . . ,x n . 
Let p = {pi, . . . ,p m ). Let D G M. n be an open set. Let C be the convex hull of {p(x) \ x G D}. 
Assume that G" C . There exists s G M. m such that s T p(x) > for all x G D. 

Proof. Suppose the polynomial is a linear combination of the other polynomials, i.e., 

Pi( x ) = Ylj^i c jPj( x )- Let p' = (Pi) • • • ,Pi-i,Pi+i, ■ ■ ■ ,Pm)- Let C be the convex hull of 
{p'(x) | x G D}. Then 

(yi, ■ ■ ■ ,Vi-i,Vi+i, ■ ■ ■ ,y m ) ^ yi, ■ ■ ■ ,yi-i,^2cjyj,y i+ i, . . . ,y m 

is a bijection between points in C and C. Note, C . There exists a strictly separating 
hyperplane between and C (i.e., there exists s G M. m such that s T p(x) > for all x G .D) if 
and only if there exists a strictly separating hyperplane between and C (i.e., there exists 
s' G M. m such that s' T p'{x) > for all x G -D). Hence, without loss of generality, we can 
assume that the polynomials pi, ■ ■ ■ ,p m are linearly independent. 

Since C is convex and C, by the separating hyperplane theorem, there exists s ^ 
such that s T p{x) > for all x G -D. If s T p(x) > for all x G -D we are done. 

Now suppose s T p(a) = for some a G D. If a ^ 0, then by translating D by —a 
and changing the polynomials appropriately (namely, Pj(x) := pj(x + a)), without loss of 
generality, we can assume a = 0. 

Let r(xi,...,x n ) = s T p(xi, . . . , x n ). Note that r is multi-linear and r / because 
Pi, . . . ,p m are linearly independent. Since a = 0, we have r(0) = and hence r has no 
constant monomial. 

Let w be the monomial of lowest total degree which has a non-zero coefficient in r. 
Consider y = (yi, . . . , y m ) where yj ^ for Xj which occur in w and yj = for all other Xj. 
Then, r(y) = w(j) since there are no monomials of smaller degree, and any other monomials 
contain some yj which is 0. Hence by choosing y sufficiently close to 0, we have y G D (since 
D is open) and r(y) < (by choosing an appropriate direction for y). This contradicts the 
assumption that s is a separating hyperplane. Hence r = which is a contradiction with 
the linear independence of the polynomials p%, . . . ,p m . □ 

We now prove our duality theorem. 

Proof of Theorem ^| Clearly there cannot exist both a non-identifiable mixture and a linear 
test. Let C, be the convex set of mixtures arising from Tj (for i = 2,3). Assume that 
C2 H C3 = 0, i. e., there is no non-identifiable mixture in Ai. Let C = C2 — C3. Note that 
C is convex, C, and C is the convex hull of an image of open sets under multi-linear 
polynomial maps (by Observation IBJ) . By Lemma there exists s / such that s T x > 
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for all x £ C. Let t = s — s^ 14 ) (where a*- 14 ) is defined as in ©). Let \xi £ C2 and let 
/i3 = /i^ 14 ^ where ls defined analogously to s^ 14 ). Then 

t T H2 = (s - s (14) ) T /i 2 = s T /U 2 - s T fi 3 = s T (fi 2 ~ M3) > 0. 
Similarly for ^3 G C3 we have t T fi^ < and hence t is a test. □ 

4 Simplifying the search for a linear test 

The CFN, JC, K2 and K3 models all have a natural group-theoretic structure. We show 
some key properties of linear tests utilizing this structure. These properties will simplify 
proofs of the existence of linear tests in JC and K2 (and restricted CFN) models and will 
also be used in the proof of the non-existence of linear tests in K3 model. Our main objective 
is to use symmetry inherent in the phylogeny setting to drastically reduce the dimension of 
the space of linear tests. 

Symmetric phylogeny models have a group of symmetries G < Sq (G is the intersection 
of the automorphism groups of the weighted graphs corresponding to the matrices in M). 
The probability of a vertex labeling of T does not change if the labels of the vertices are 
permuted by an element of G. Thus the elements of £l n which are in the same orbit of the 
action of G on Q n have the same probability in any distribution arising from the model. 

Let O' be the orbits of Q 4 under the action of G. Let O be the orbits of O' under the 
action of K. Note that the action of (14) on O is well defined (because K is a normal 
subgroup of Si). For each pair 01,02 GO that are swapped by (14) let 

^,02^) = ^2 K a ) ~ ^2 K a )- (5) 

aeoi a<=02 

Lemma 10. Suppose that M. has a linear test s. Then M. has a linear test t which is a 
linear combination of the £01,02 ■ 

Proof. Let s be a linear test. Let 

t'=^2s 9 and t = t'-t'( 14 ). 

Let [12 be a mixture arising from T%. For any g £ K the mixture ju| arises from T2 and 
hence 

(0 T ^ = E( s9 ) T ^ = E sT (^)>o. 

g£K g&K 

Similarly (t') T ^3 < for ^3 arising from T3 and hence t' is a linear test. 
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Now we show that t is a linear test as well. Let \i2 arise from T^. Note that = n 2 
arises from T3 and hence 

Ff* = (f " t' (14) ) T ^2 = (t') T M2 " (0^3 > 0. 

Similarly t(f/,s) < for /U3 arising from T3 and hence t is a linear test. 

Note that t is zero on orbits fixed by (14). On orbits 01,02 swapped by (14) we have 
that t has opposite value (i.e., a on o\, and —a on 02 for some a). Hence t is a linear 
combination of the £01,02 ■ d 

4.1 A simple condition for a linear test 

For the later proofs it will be convenient to label the edges by matrices which are not allowed 
by the phylogenetic models. For example the identity matrix I (which corresponds to zero 
length edge) is an invalid transition matrix, i.e., I M, for the models considered in this 
paper. 

The definition Q is continuous in the entries of the matrices and hence for a weighting 
by matrices in cl(A^) (the closure of M) the generated distribution is arbitrarily close to a 
distribution generated from the model. 

Observation 11. A linear test for Ai (which is a strictly separating hyperplane for M) is 
a separating hyperplane for cl(.M). 

The above observation follows from the fact that if a continuous function / : cl(A) — > R 
is positive on some set A then it is non- negative on cl(A). 

Suppose that the identity matrix / € cl(.M). Let \i arise from T2 with weights such that 
the internal edge has weight I. Then [11 arises also from T3 with the same weights. A linear 
test has to be positive for mixtures form T2 and negative for mixtures from T3. Hence we 
have: 

Observation 12. Let /i arise from T2 with weights such that the internal edge has transition 
matrix I. Let t be a linear test. Then t T fi = 0. 

5 Linear tests for JC and K2 

In this section we show a linear test for JC and K2 models. In fact we show that the linear 
invariants introduced by Lake |14j are linear tests. We expect that this fact is already 
known, but we include the proof for completeness and since it almost elementary given the 
preliminaries from the previous section. 



13 



5.1 A linear test for the Jukes-Cantor model 



/ 3x 


—x 


— X 


—x 


\ 


—x 


3x 


—x 


—x 




— X 


—x 


Sx 


—x 




\ —x 


—x 


— X 


3x 


J 



To simplify many of the upcoming expressions throughout the following section, we center 
the transition matrix for the Jukes-Cantor (JC) model around its stationary distribution in 
the following manner. Recall the JC model has J7jc = {0, 1,2,3} and its semigroup Mjc 
consists of matrices 



Mjc(x) = \e + 



where E is the all ones matrix (i.e., E(i,j) = 1 for all < i,j < |f2|) and < x < 1/4. 

We refer to x as the centered edge weight. Thus, a centered edge weight of x = 1/4 
(which is not valid) means both endpoints have the same assignment. Whereas x = (also 
not valid) means the endpoints are independent. 

The group of symmetries of Qjc is Gjc = &4 ■ There are 15 orbits in f2j C under the 
action of Gjc (each orbit has a representative in which i appears before j for any i < j). 
The action of K further decreases the number of orbits to 9. Here we list the 9 orbits and 
indicate which orbits are swapped by (14): 

0111 

0000 , 0110 , 0123 , , 12 , 0011 « 0101 , °^ « «lo2 " ( 6 ) 

0001 

By Lemma HU1 every linear test in the JC model is a linear combination of 
h = //(0011) -/i(0101), and 

t 2 = //(0122)-ju(0121)-/i(0102)+/i(0012). (7) 

We will show that ti — t\ is a linear test and that there exist no other linear tests (i. e., 
all linear tests are multiples of t<i — t\). 

Lemma 13. Let fi be a single-tree mixture arising from a tree T on 4 leaves. Let t be 
defined by 

t T f Jl = t 2 -t 1 = /i(0122) -/i(0121) +/i(0101) 

-/i(0011) +M0012) -/x(0102). (8) 

Let fii arise from Ti, for i = 1,2,3. We have 

t T m = 0, t T fi 2 > 0, and t T fi 3 < 0. 



In particular t is a linear test. 
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Proof. Label the 4 leaves as v\, . . . , V4, and let x%, . . . , X4 denote the centered edge weight 
of the edge incident to the respective leaf. Let X5 denote the centered edge weight of the 
internal edge. 

Let fj,j arise from Tj with centered edge weights x%, . . . , X5, j £ {1, 2, 3}. Let <E>j be the 
multi- linear polynomial t T [ij. If X4 = then [ij does not depend on the label of V4 and 
hence, for all a G fijc, 

% = /ij(012a) - |U f(012a) + /^(OlOa) - ^(OOlo) + /if (001a) - ^(OlOa) = 0. 

Thus X4 divides The ti are invariant under the action of K (which is transitive on 
1,2,3,4) and hence $j is invariant under the action of K. Hence Xi divides &j for i = 
1, . . . ,4. We have 

&j = 5; x £2X3X4^ (£5), 

where £(xs) is a linear polynomial in X5. 

Let fx'i arise from T\ with x\ = ■ ■ ■ = X4 = 1/4. In leaf-labelings with non-zero probabil- 
ity in fj,^ the labels of t>i,t>4 agree and the labels of ^2,^3 agree. None of the leaf-labelings 
in |SJ) satisfy this requirement and hence $1 = if x\ = ■ ■ ■ = x± = 1/4. Hence £(xs) is the 
zero polynomial and is the zero polynomial as well. 

Now we consider T<i- If X5 = 1/4 then, by Observation 1121 <3?2 = 0. Thus 1/4 is a root of 
I and hence $2 = ct • x\x2X3X4(l/4 — X5). We plug in X5 = and x\ = X2 = X3 = X4 = 1/4 
to determine a. Let /z 2 be the distribution generated by these weights. The leaf-labelings 
for which /U 2 is non-zero must have the same label for i>i,t>3 and the same label for ^2,^4- 
Thus $ 2 = /4(0101) = 1/16 and hence a = 64. We have 

$2 = 64x1X2X3X4(1 - 4x 5 ). 

Note that $2 is always positive. The action of (1 4) switches the signs of the ti and hence 
$ 3 = — <J> 2 . Thus $3 is always negative. □ 

We now show uniqueness of the above linear test, i.e., any other linear test is a multiple 
of t 2 - h- 

Lemma 14. Any linear test in the JC model is a multiple of (0). 

Proof. Let t = + 0^2 be a linear test. Let ^1 be the distribution generated by centered 
weights X2 = X4 = X5 = 1/4 and xi = X3 = on T2 . By Observation ^] we must have 
t T m = 0. Note that 

, , f 1/64 if a 2 = a 4 , 

^i(a 1 a 2 a 3 a 4 ) = ( Q otherwige _ 

Hence 

fin = -aiMi(OlOl) - a 2 /i(0121) = -1/64(qi + a 2 ) = 0. 
Thus ol\ = —a2 and hence t is a multiple of pjl. □ 
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5.2 A linear test for Kimura's 2-parameter model 

Mutations between two purines (A and G) or between two pyrimidines (C or T) are more 
likely than mutations between a purine and a pyrimidine. Kimura's 2-parameter model 
(K2) tries to model this fact. 

We once again center the transition matrices to simplify the calculations. The K2 model 
has Q,K2 = {0, 1,2,3} and its semigroup M.K2 consists of matrices 

( x + 2y -x -y -y \ 
-x x + 2y -y -y 



M K2 {x,y) = \e + 



—y —y x + 2y —x 
V -y -y -x x + 2y ) 





0111 


0222 


0223 


0221 


0100 


0200 


0210 


0230 ' 


0010 


0020 


0120 




0001 


0002 


0112 



with x < y < 1/4 and x + y > 0. See Felsenstein l$ : for closed form of the transition 
matrices of the model in terms of the times t e and rate matrices R e . One can then derive 
the equivalence of the conditions there with the conditions x < y < 1/4, x + y > 0. 

Note, x can be negative, and hence certain transitions can have probability > 1/4 but 
are always < 1/2. Observe that Mk2(x,x) = Mjc(x), i.e., the JC model is a special case 
of the K2 model. 

The group of symmetries is Gk2 = ((01), (02)(13)) (it has 8 elements). There are 36 
orbits in Q 4 under the action of Gk2 (each orbit has a representative in which appears 
first and 2 appears before 3). The action of K further decreases the number of orbits to 18. 
The following orbits are fixed by (14): 



0000 , 0110 , 0220 , 0231 



The following orbits are swapped by (14): 

0233 0232 

0011- 0101,0022 - 0202,0123 -0213, - - ^1 

0012 0102 

By Lemma any linear test for the K2 model is a linear combination of 

h = /i(0011) — yu(0101), 

t 2 = m(0233) + /i(0211) - /i(0232) - /x(0201) 

-A*(0121) + /i(0021) - /i(0102) + /i(0012), 
t 3 = ^(0022) -/x(0202), 
t 4 = /i(0122) + /i(0023) - /i(0212) - /Lt(0203), 
t 5 = ju(0123) - /i(0213) 
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Lemma 15. Let fj, be a single-tree mixture arising from a tree T on 4 leaves. Let t be 
defined by 



t T n = /i(0122) - //(0212) + /i(0202) - ^(0022) + 

/i(0023) - m(0203) + /x(0213) - /i(0123). (9) 

Lei fii arise from Ti, for i = 1,2,3. We have 

i T //i = 0, t T fi 2 > 0, and t T fi 3 < 0. 

In particular t is a linear test. 

Proof. Let T = Tj for some j £ {1,2,3}. Let the transition matrix of the edge incident 
to leaf Vi be MK2(xi,yi), and the internal edge has Mk^Oesj^s)- Let fij be the generated 
distribution, and let $j be the multi-linear polynomial t T fij. 

If i/4 = — X4 then the matrix on the edge incident to leaf V4 has the last two columns 
the same. Hence roughly speaking this edge forgets the distinction between labels 2 and 3, 
and therefore, in @, we can do the following replacements: 

0122 -» 0123 
0202 0203 
0022 -» 0023 
0213 0212, 

and we obtain, 

$j = 0. (10) 

Thus X4 — y4 divides $j . Since <3?j is invariant under the action of K we have that Xi — y% 
divides $j for i = 1, . . . , 4 and hence 

<^ = (xi - yx) ... (s4 - y4)^j{x5,m), (n) 

where ijix^,y^) is linear in X5 and g/5. 

Now let Xj = j/j = 1/4 for i = 1, . . . , 4. The label of the internal vertices for j = 5, 6 
must agree with the labels of neighboring leaves and hence 

$j- = ^ (0202) - ^(0022). (12) 

Now plugging j = 1 into (|TT|) and (fT2|) , for this setting of Xi,yi, we have 

$1 = (13) 
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By plugging j = 2, 3 into and (fT2*|) we have 

$ 2 = -*3 = {xi - m)(x 2 - m)(x3 - m){xi - yi)(i - 4y 5 ). (14) 

Note that Xi — y% < and % < 1/4 and hence (|14|) is always positive. Linearity of the test 
fi i — ^ implies that is positive for any mixture generated from T 2 and negative for 
any mixture generated from T 3 . □ 

Lemma 16. Any linear test in the K2 model is a multiple of 

Proof. Let t = a\t\ + • • • + 05^5 be a linear test. A linear test in the K2 model must work 
for JC model as well. Applying symmetries Gjc we obtain 



t = ai(/i(0011)-/i(0101)) + 2a 2 (//(0122)-/i(0121)-/i(0102)+/i(0012)) / 

(15) 

+a 3 (/i(0011) - /i(0101)) + a 4 (/i(0122) + //(0012) - ^(0121) - //(0102)). 

Comparing (|15jl with (JSJ) we obtain 

«1 + "3 = —1 and «4 + 2a 2 = 1- (16) 

Let /ii arise from T2 with centered weights (x 2 ,y 2 ) = (£4,2/4) = (£5,2/5) = (1/4,1/4), 
(x\,yi) = (1/4,0), and (£3,2/3) = (x,y)- From observation IT2l it follows that <J? Ml = 0. The 
leaf-lab elings with non-zero probability must give the same label to v 2 and V4, and the labels 
of vi and v 2 must either be both in {0, 1} or both in {2,3}. The only such leaf-labelings 
involved in ti, . . . , £5 are 0101, 0121. Thus 

t T m = -ai/ii(0101) - a 2/ ui(0121) = - — (aix + a 2 y) = 0. (17) 



Thus ai = a 2 = and from (|T6|) we get 03 = —1 and 04 = 1. 

Let fi 2 be generated from T 2 with centered weights (x4,2/4) = (£5,2/5) = (1/4,1/4), 
(£1,2/2) = (£3,2/3) = (0,0), and (£2,2/2) = (1/4,0). In leaf-labelings with non-zero proba- 
bility the labels of v 2 and are either both in {0, 1} or both in {2,3}. The only such leaf 
labelings in t 3 ,t 4 ,t 5 are 0202,0213,0212, and 0203. Hence 

t T liz = /i 3 (0202) - /x 3 (0212) - /i 3 (0203) - a 5 ^ 3 (0213) = -J-(3 - 3 - 1 - a 5 ) = 0. 

zoo 

Thus 05 = —1 and all the «j are determined. Hence the linear test is unique (up to scalar 
multiplication) . □ 
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6 Non-identifiability and linear tests in CFN 



In this section we consider the CFN model. We first prove there is no linear test, and then 
we present a non-identifiable mixture distribution. We then show that there is a linear test 
for the CFN model when the edge probabilities are restricted to some interval. 



6.1 No linear test for CFN 

Again, when considering linear tests we look at the model with its transition matrix centered 
around its stationary distribution. The CFN model has Ocfn = {0, 1} and its semigroup 
7V4cfn consists of matrices 

M C fn(x) = Ie+( 1 ~ X 
2 \ —x x 

with < x < 1/2. 

In the CFN model, note that the roles of and 1 are symmetric, i. e., 

M T ^(°l) ■■■!%) = ^t,p^ ~ ai > • • • ' 1 ~~ ° n )- ( 18 ) 

Hence the group of symmetries of SIcfn is Gcfn = ((01)) = Z/(2Z). There are 8 orbits 
of the action of Gcfn ° n ^cfn ( one can choose a representative for each orbit to have the 
first coordinate 0). The action of K further reduces the number of orbits to 5. The action 
of (1 4) swaps two of the orbits and keeps 3 of the orbits fixed: 

0111 

0000 , 0110 , gj^g , 0011 <-> 0101 . 
0001 

By Lemma lTTTl if there exists a linear test for CFN then (a multiple of) t\ = ^(0011)— /u(0101) 
is a linear test. Let \i arise from T 2 with the edge incident to leaf Vi labeled by Mcfn(Si)> 
for i = 1, . . . , 4, and the internal edge labeled by Mcfn(^5)- A short calculation yields 

/i(0011) - /x(0101) = x 5 (xix 3 + x 2 x A ) - XlX2 + x ^ ( 19 ) 

Note that (JH3) is negative if X5 is much smaller than the other Xi\ and (|19|) is positive if 
xi,X3 are much smaller than the other Xj. Thus t% is not a linear test and hence there 
does not exist a linear test in the CFN model. By Theorem 0] there exists a non-identifiable 
mixture. The next result gives an explicit family of non-identifiable mixtures. 
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6.2 Non-identifiable Mixture for CFN 



For each edge e we will give the edge probability < w e < 1/2, which is the probability 
the endpoints receive different assignments (i.e., it is the off-diagonal entry in the transition 
matrix. For a 4-leaf tree T, we specify a set of transition matrices for the edges by a 5- 
dimensional vector P = (w\,W2, 11)3,11)4,11)5) where, for 1 < i < 4, u>i is the edge probability 
for the edge incident to leaf labeled i, and w§ is the edge probability for the internal edge. 

Proposition 17. For < a, b < 1/2 and < p < 1/2, set 

Pi = -1 — (a, b, a, b, c) and 
P 2 = -1 - (b,a,b,a,d), 

where 1 = (1, 1, 1, 1, 1) and 

c = z/p, 

d = z/(l — p), and 
ab 

Z = 2(a 2 + b 2 ) ' 

Let 

M = PM r3> 3 + (l-p)M r3i 3- 

The distribution fi is invariant under it = (14). Hence, fi is also generated by a mixture 
from T n , a leaf-labeled tree different from T = T3. In particular, the following holds: 

M = PM Ta) 3 + (l-p)M Tai 3- 

Hence, whenever c and d satisfy < c, d < 1/2 then fi is in fact a distribution and there 
is non-identifiability. Note, for every < p < 1/2, there exist a and b which define a 
non-identifiable mixture distribution. 

Proof Note that tt = (1 4) fixes leaf labels 0000, 0010, 0100, 0110 and swaps 0011 with 0101 
and 0001 with 0111. 

A short calculation yields 

/i(0011) -/i(0101) = ab- (a 2 + b 2 )(pc+ (1 -p)d),and 
//(0001) -/i(0111) = (a 2 - b 2 )(pc- (1 -p)d). 

which are both zero for our choice of c and d. This implies that [i is invariant under the 
action of (14), and hence non- identifiable. □ 
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6.3 Linear test for CFN with restricted weights 

Lemma 18. Let a G (0, 1/2). // the centered edge weight x for the CFN model is restricted 
to the interval (a, \/ a — a 2 ) then there is a linear test. 

Proof. We will show that ()19|) is positive if the Xi are in the interval (a, \J a — a 2 ). Let 
b = - a 2 ). Note that < a < b < 1/2. 

Since (|19[) is multi-linear, its extrema occur when the x~i are from the set {o, b} (we 
call such a setting of the X{ extremal). Note that the Xi are positive and X5 occurs only in 
terms with negative sign. Thus a minimum occurs for X5 = b. The only extremal settings 
of the Xi which have X1X3 + X2X4 > x±X2 + X3X4 are x\ = X3 = b, X2 = X4 = a and 
xi = X3 = a,x~2 = x~i = b. For the other extremal settings (|19|) is positive, since b < 1/2. 
For xi = X3 = b, X2 = X4 = a the value of (fT§)) is b(a — (a 2 + b 2 )). □ 

Remark 19. In contrast to the above lemma, it is known that there is no linear invariant 
for the CFN model. This implies that there is also no linear invariant for the restricted 
CFN model considered above, since such an invariant would then extend to the general 
model. This shows that the notion of linear test is more useful in some settings than linear 
invariants. 

7 Non-identifiability in K3 

In this section we prove there exists a non-identifiable mixture distribution in the K3 model. 
Our result holds even when the rate matrix is the same for all edges in the tree (the edges 
differ only by their associated time), i.e., the common rate matrix framework. Morevoer, 
we will show that for most rate matrices R in the K3 model there exists a non-identifiable 
mixture in which all transition matrices are generated from R. 

The Kimura's 3-parameter model (K3) has f2j<2 = {0,1,2,3} and its semigroup A4k3 
consists of matrices of the following form (which we have centered around their stationary 
distribution): 

-y 

-x 

x + y + z j 

with x < y < z < 1/4, x + y > 0, and (x + y) > 2(x + 1z)(y + z). Note that Mk3(x, y, y) = 
Mk2(x, y), i. e., the K2 model is a special case of the K3 model. 

The group of symmetries is Gk3 = ((01)(23), (02)(13)) (which is again the Klein group). 
There are 64 orbits in fi 4 under the action of Gk3 (each orbit has a representative in which 



M K 3{x,y,z) 



( x + y + z -x -y 

—x x + y + z —z 

—y —z x + y + z 

—z —y —x 
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appears first). The action of K further decreases the number of orbits to 28. The following 
orbits are fixed by (14): 



0000 , 0110 , 0220 , 0330 



0331 


0332 


0223 


0111 


0222 


0333 


0320 


0310 


0210 


0100 


0200 


0300 


0230 ' 


0130 : 


0120 ' 


0010 ' 


0020 ' 


0030 


0221 


0112 


0113 


0001 


0002 


0003 



The following orbits switch as indicated under the action of (14): 



0322 


0232 


0233 


0323 


0133 


0313 


0311 


0201 
0131 ' 


0211 


0301 


0122 


0302 


0021 


0031 *" 


"* 0121 ' 


0032 


"* 0212 


0012 


0102 


0013 


0103 


0023 


0203 



and 0011 0101,0022 «-► 0202,0033 «-> 0303,0123 «-► 0213,0132 «-► 0312,0231 «-> 0321. 



7.1 No Linear Test for K3 

By Lemma ITOl anv test is a linear combination of 

h = /i(0011) -/i(0101), 

h = /x(0322) + /i(0311) - /x(0232) - /x(0201) - 
//(0131) - //(0102) + /x(0021) + /x(0012), 

t 3 = /x(0233) + /i(0211) - /x(0323) - /i(0301) 

+A*(0031) + /i(0013) - /i(0121) - /i(0103), 

t 4 = //(0022) -/i(0202), 

t 5 = /x(0133) + /i(0122) + /x(0032) + ^(0023) 

-/x(0313) - /i(0302) - /x(0212) - /x(0203), 



*6 


= ^(0033) 


-/i(0303), 


*7 


= /i(0123) 


-/i(0213), 


*8 


= /x(0132) 


-K0312), 


t 9 


= /x(0231) 


-/i(0321). 



We first present a non-constructive proof of non-identifiability by proving that there 
does not exist a linear test, and then Theorem 0] implies there exists a non-identifiable 
mixture. We then prove the stronger result where the rate matrix is fixed. 

Lemma 20. There does not exist a linear test for the K3 model. 

Corollary 21. There exists a non-identifiable mixture in the K3 model. 
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Proof of LemmaWtX Suppose that t = a.\t\ + ••• + otgtg is a test. Let W{ = (x{,yi,Zi), 
1 < i < 5. For 1 < i < 4, Wi denotes the centered parameters for the edge incident to leaf 
i, and are the centered parameters for the internal edge. 

In the definitions of fii and // 2 below we will set W2 = = W5 = (0, 0, 0). This ensures 
that in labelings with non-zero probability, leaves f 2 ,t>4 and both internal vertices all have 
the same label. Moreover, by observation 1121 ^(fJ-i) = 0. 

Let \i\ be generated from T 2 with w\ = (1/4, and W3 = (1/4,0,0). In labelings 

with non-zero probability, the labels of V2 and V3 have to both be in {0, 1} or both in {2, 3}. 
The only labels in t\, . . . ,tg with this property are 0101, 0232, 0323. Thus, 

t T m = - aiA t(0101) - a 2 /ii(0232) - a 3J ui(0323) = - — (ai/4 + a 2 yi + a 3 2i) = 0. (20) 

lb 

Any y,z £ [1/8 — e, 1/8 + e] with y' > z' gives a valid matrix in the K3 model. For (|20j) 
to be always zero we must have ot\ = 02 = as = (since l,y\,2i are linearly independent 
polynomials). 

Let /j,2 be generated from T2 with u)i = (l/4,yi,ii), and {03 = (1/4, y,z). The only 
labels in t4, . . . , tg with t> 2 and having the same label are 

0202,0313,0212,0303 

(we ignore the labels in £1, *3 because a\ = 02 = 03 = 0). Thus 

t T m = -a 4 ^2(0202) - a 5 /x 2 (0313) - a 5 ^ 2 (0212) - a 6 ^ 2 (0303) 

= -^(a 4 (yi) 2 + 2a 5 yi£i + a 6 (?i) 2 ) = 0. 

Polynomials (yi) 2 ,yi?i, (?i) 2 are linearly independent and hence 04 = a§ = a§ = 0. 

Thus t is a linear combination of £7, ig, ig and hence has at most 6 terms. A test for K3 
must be a test for K2, but the unique test for K2 has 8 terms. Thus <3? cannot be a test for 
K3. 

□ 



7.2 Non-identifiable mixtures in the K3 model with a fixed rate matrix 



We will now consider rate matrices for the model, as opposed to the transition probability 
matrices. The rate matrix for the K3 model is 



R = R(a,0,-y) 



( -a-0 
a 



\ 7 



7 



a 

-a — — 7 

7 





7 

-a — — 7 
a 



7 

a 



\ 



(22) 



-a- 0-<y J 
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where the rates usually satisfy a > (3 > 7 > 0. For our examples we will also assume that 
a, (3, 7 £ [0, 1]. Since the examples are negative they work immediately for the above weaker 
constraint. 

Recall, the K2 model is a submodel of the K3 model: the rate matrices of the K2 model 
are precisely the rate matrices of K3 model with = 7. By Lemma Hoi there exists a test 
in the K2 model and hence there are no non-identifiable mixtures. We will show that the 
existence of a test in the K2 model is a rather singular event: for almost all rate matrices 
in the K3 model there exist non-identifiable mixtures and hence no test. 

We show the following result. 

Lemma 22. Let a, f3, 7 be chosen independently from the uniform, distribution on [0,1]. 
With probability 1 ( over the choice of a, (3, 7 J there does not exist a test for the K3 model 
with the rate matrix R(a, f3, 7). 

To prove Lemma |22] we need the following technical concept. A generalized polynomial 
is a function of the form 

m 

Y,<n l ,...,u n )e b ^-^\ (23) 
i=i 

where the a% are non-zero polynomials and the b{ are distinct linear polynomials. Note 
that the set of generalized polynomials is closed under addition, multiplication, and taking 
derivatives. Thus, for example, the Wronskian of a set of generalized polynomials (with 
respect to one of the variables, say u±) is a generalized polynomial. 

For n = 1 we have the following bound on the number of roots of a generalized polyno- 
mial (see (20], part V, problem 75): 

Lemma 23. Let G = YuiLi ai(u)e bi ^ be a univariate generalized polynomial. Assume 
m > 0. Then G has at most 

m 

(m — 1) + y~] deg aj(u) 
i=i 

real roots, where degaj(u) is the degree of the polynomial ai{u). 

Corollary 24. Let G{u\, . . . ,u n ) = Y^4=i a i( u li ■ ■ ■ ,u n )e bi<yUl, '"' Un ' > be a generalized polyno- 
mial. Assume m > (i. e., that G is not the zero polynomial). Let r\,...,r n be picked 
in independently and randomly from uniform distribution on [0, 1] . Then with probability 1 
( over the choice of r%, . . . , r n ) we have G{r\, . . . ,r n ) ^ 0. 

Proof of Corollary \2J\ We will proceed by induction on n. The base case n = 1 follows 
from Lemma |23 since the probability of a finite set is zero. 

For the induction step consider the polynomial a\. Since a\ is a non-zero polynomial 
there are only finitely many choices of c G [0,1] for which a± (c, ui , . . . , u n ) is the zero 
polynomial. Thus with probability 1 over the choice of u\ = c we have that ai(c, U2, ■ ■ ■ , u n ) 
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is a non-zero polynomial. Now, by the induction hypothesis, with probability 1 over the 
choice of 112, ■ ■ ■ , u n we have that G(c, 112, ■ ■ ■ , u n ) 7^ 0. Hence with probability 1 over the 
choice of ui, . . . ,u n we have that G(«i, . . . , u n ) ^ 0. □ 



Proof of Lemma \2°/A Any test has to be a linear combination of ti,...,tg defined in Sec- 
tion [7| Suppose that t = a\t\ + . . . agtg is a test. We will use Observation PHZl to show that 
o\ = ■ ■ ■ = <7 9 = and hence t cannot be a test. Let a = (ai, . . . , <t 9 ). 

The transition matrix of the process with rate matrix R, at time s is T(s) = exp(sR). 
We have (see, e.g., [H]) 



/ A+B+C 



T(s) 



where 



Mrs = i E + 



-A 

-B 



A + B + C 
-C 



-B 
-C 
A + B + C 



-C 
-B 
-A 



\ 



(24) 





\ 


-C 




—B 




-A + B + / 


A = 


(1 + e" 


-2s(a+/3) 


+ e 


-2s(a+ 7 ) 


— e 


-2s(J3+y)\ 


B = 


{l + e 


-2s(a+/3) 


— e~ 


-2s(q+ 7 ) 


+ e" 


-2s(^+7)w 4) 


C = 


(1-e 


-2s(a+/3) 


+ e" 


-2s(a+ 7 ) 


+ e" 


-2s(/3+ 7 )) / 4 _ 



We have changed notation from our earlier definition of Mk3 by exchanging A, B, C for 
x, y, z to indicate these as functions of a, /3 and 7. 

Let /i x be generated from T3 with edge weights T(x), T(2x), T(3x), T{Ax), T(0). The 
internal edge has length (i.e., it is labeled by the identity matrix), and hence, by Obser- 
vational we must have t T ii x = for all x > 0. 

Let fi(x) = t[fj, x for i = 1, ... ,9. Let W(x) be the Wronskian matrix of fi(x), . . . , fg(x) 
with respect to x. The entries of W{x) are generalized polynomials in variables ot,/3, 7, and 
x. Thus the Wronskian (i. e., det W(x)) is a generalized polynomial in variables a, (3, 7, and 
x. 

Now we show that for a particular choice of a, (3, 7 and x we have det W(x) 7^ 0. Let 
a = iri, P = 2m, 7 = —2iri, and x = 1. Of course complex rates are not valid in the 
K3 model, we use them only to establish that W(x) is a non-zero generalized polynomial. 
Tedious computation (best performed using a computer algebra system) yields that the 
Wronskian matrix is the following. The first 7 columns are 



8 

40177 

-1052tt 2 
-12244i7r 3 
206468-ir 4 
3199060i it 5 
-46789172ir 6 
826613044i7r 7 
V 11742908228ir 8 



-16 
-240i7r 
3736-ir 2 
52920i-n- 3 
-923656-7T 4 
-12454200! 7T 5 
230396776tt 6 
3116686680i7r 7 
-583718170967T 8 



-16 

80I7T 

856tt 2 
-4520i7r 3 
-1345367T 4 
-280600i7T 5 
254473367r 6 
184720120i7r 7 
-50473844567T 8 



56l7T 

-780-tt 2 
-13988i7r 3 
187188ir 4 
3574316iir 5 
-478073407T 6 
-908923988i7r 7 
12445876068ir 8 



„2 

TT 3 
4 



16 
80i 
-824tt 
-2600i?i 
46856-ir' 
103400; 7r 5 
-27409047T 6 
-4488200i-n- 7 
1637594967T 8 



24i7r 

-3007T 2 

732i7T 3 
9108ir 4 
-354276i?r 5 
1992900tt 6 
82589532ijT 7 
-7766752927T 8 





48I7T 
1127T 2 

-4224i7r 3 

1520vr 4 
576168iir 5 
-27818487T 6 
-98067024i?r 7 
8409934407T 8 
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The last 2 columns are 



o 

— 48i7r 



— 96in 
800tt 2 
17568ivr 3 



912jt 2 
14880iir 3 



-2124007T 4 
-3692808i7r 5 
49984152tt 6 



-1985607T 4 
-4100016i7r 5 
5119160Cbr 6 
1003352208i7T 7 



919913280i7r 7 
-12613821600tt ; 



-1332353312Cbr 8 / 



The determinant of W(x) is 33920150890618370745095852723798016000000vr 36 which is 
non-zero. Thus det W(x) is a non-zero generalized polynomial and hence for random a, j3, 7 
and x we have that detW(x) is non-zero with probability 1. 

Assume now that det W(x) 7^ 0. Let w = W(x)a. The first entry w\ of w is given 
by w\ = t T fi x . Since t is a test we have w\ = 0. The second entry W2 of w is given 
by W2 = (d/dx)t T fi x . Again we must have W2 = 0, since t is a test. Similarly we show 
W3 = ■ ■ ■ = Wg = 0. Note that W(x) is a regular matrix and hence a must be the zero 
vector. Thus t is not a test. Since det W{x) / happens with probability 1 we have that 
there exists no test with probability 1. □ 

8 Maximum Likelihood Results 

Here we prove Theorem |2j 
8.1 Terminology 

In the CFN and JC model there is a single parameter defining the transition matrix for 
each edge, namely a parameter x e where < x e < In K2, there are 2 parameters 

and in K3 there are 3 parameters. We use the term weights ~w e to denote the setting of the 
parameters defining the transition matrix for the edge e. And then we let w = (w e ) eeT 
denote the set of vectors defining the transition matrices for edges of tree T. 

We use the term zero weight edge to denote the setting of the weights so that the tran- 
sition matrix is the identity matrix I. Thus, in the CFN and JC models, this corresponds 
to setting x e = 0. We refer to a non-zero weight edge as a setting of the parameters so that 
all entries of the transition matrix are positive. Hence, in the CFN and JC models this 
corresponds to x e > 0. 

Note, Ct{^) is maximized over the set of weights w . Hence our consideration of weights 
w . There are typically constraints on ~w in order to define valid transition matrices. For 
example, in the CFN model we have ~w e = x e where < x e < 1/2, and similarly < x e < 
1/4 in the JC model. In the K2 model we have w e = (x, y) where x > y > 0, x + y < 1/2. 
Finally in the K3 model we have further constraints as detailed in Section 
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8.2 Technical Tools 



Our proof begins starts from the following observation. Consider the CFN model. Note 

that for x = we have Pi = P2 and hence [1q = fx Ts where if = (1/2, 1/2, 1/2, 1/2, 0) 
are the weights for the CFN model, i.e., is generated by a pure distribution from T3. In 
fact it can be generated by a pure distribution from any leaf-labeled tree on 4 leaves since 
the internal edges have zero length. 

Observation 25. Consider a tree T on n leaves and weights w where all internal edges 
have zero weight. For all trees S 7^ T on n leaves, there is a unique weight ~v such that 

Proof. Let 1? have the same weight as w for all terminal edges, and zero weight for all 
internal edges. Note, we then have Hsi? = ^Tw an d it remains to prove uniqueness of v . 
Let ~u be a set of weights where ^sit = ^Sw ■ Let De obtained from S by contracting 

all the edges of zero weight in ~Tt, and let v! be the resulting set of weights for the remaining 
edges. The tree S' has all internal vertices of degree > 3 and internal edges have non-zero 
weight. 

It follows from the work of Buneman (Sj that <S", u' is unique among trees with non-zero 
internal edge weights and without vertices of degree two. If u = ~v, then S' is a star, and 
hence every ~u must contract to a star. This is only possible if ~u assigns zero weight to all 
internal edges. Therefore, v is the unique weight. □ 

For w and w' defined for the CFN model in Theorem for any 4-leaf tree S, the 
maximum of Cs(fJ-o) is achieved on v = (1/4, 1/4, 1/4, 1/4, 0). For any v ' 7^ v the distribution 
fis,v' is different from fi and hence £$^'{^0) < As,i;(a*o) = A*o ^ n A*o- Intuitively, for small x 
the maximum of Cs{^ x ) should be realized on a v" which is near v. Now we formalize this 
argument in the following lemma. Then in Lemma EH we will use the Hessian and Jacobian 
of the expected log-likelihood functions to bound Cs(fJ> x ) m terms of Cs(fJ>o)- 

Lemma 26. Let n be a probability distribution on Vt n such that every element has non-zero 
probability. Let S be a leaf-labeled binary tree on n leaves. Suppose that there exists a unique 
v in the closure of the model such that fj,g v = fx. Then for every 5 > there exists e > 
such that for any // with ||// — n\\ < e the global optima of Cg(fjf) are attained on v' for 
which I \v' — v\ I < 5. 

Remark 27. In our application of the above lemma we have \i = Consider a tree S and 
its unique weight v where /j,s }V = Mo- Note, the requirement that every element in {0, l} n has 
non-zero probability, is satisfied if the terminal edges have non-zero weights. In contrast, 
the internal edges have zero weight so that Observation HS| applies, and in some sense the 
tree is achievable on every topology. 



27 



Proof. We will prove the lemma by contradiction. Roughly speaking, we now suppose that 
there exists // close to jx where the maximum of // is achieved far from the maximum of fi. 
Formally, suppose that there exists 5 > and sequences \J i and v[ such that lim^oo ||/i — 
=0 and \ \v[ — v\ \ > 5 where v[ is a weight for which the optimum of Cs(^'i) is attained. 
By the optimality of the v[ we have 

Cs^yCsM)- (25) 

We assumed fj,s,v = ^ an d hence the entries of In ns,v are finite since we assumed \x has 
positive entries. Thus 

lim C s , v {lj!i) = lim (f/J In fj,s,v) = /^ln/x^ = C s , v {^)- (26) 

i— >oo i— >oo 

Take a subsequence of the v[ which converges to some v' . Note, v' is in the closure of the 
model. 

Let e be the smallest entry of /x. For all sufficiently large i, \J i has all entries > e/2. 
Hence, 

£s><(aO = K^MMs,^)) - 2 ^ ^(MS^W) ( 27 ) 

Because of (|2f))) . for all sufficiently large i, £s 5 „(/^) > ^C-s,v{^) (recall that the log- 
likelihoods are negative). Combining with (|27j) . we have that the entries of ln/i St / are 
bounded from below by £s ;t ,(/i)/4e. Thus, both ^ and ln/i s y are bounded. For bounded 
and convergent sequences, 

lim a n b n = lim a n lim 6 n 

n— >oo n— >oo n— +00 

Therefore, 

lim £ 5 y(/4) = (hm //f)(lim In //£,/) = /i T ln/x 5it/ = C s ,v'(p)- (28) 

From (|25|) . (|26|) . and (|28|) we have Csy(fJ.) > Cs iV {^)- Since ti' / w we get a contradiction 
with the uniqueness of v. □ 

We now bound the difference of £s(//) and Csif 1 ) when the previous lemma applies. 
This will then imply that for x sufficiently small, Cs{^ x ) is close to Cs(^o). 
Here is the formal statement of the lemma. 

Lemma 28. Let /i G W^ n be a probability distribution on £l n such that every element has 
non-zero probability. Let S be a leaf-labeled binary tree on n vertices. Suppose that there 
exists ~ 7 v in the closure of the model such that fi s ^- = fi and that v is the unique such 
weight. Let Afj, x 6 RW be such that A/i^l = 0, and x i— » fi x is continuous around in the 
following sense: Afj, x — > AfiQ as x — > 0. 
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Let g(w) = C s ^{p), and h x (w) = {A[i x ) T In /j, s . Let H be the Hessian of g at if 
and J x be the Jacobian of h x at if. Assume that H has full rank. Then 

2 

C s (n + xAu x ) < n T hxn + xh x {lf) - yJofl"- 1 ^ + o(x 2 ). (29) 

Moreover, if (H~ l J T )i < for all i such that v i = then the inequality in h2!A) can be 
replaced by equality. 

Remark 29. When (H -1 J T )i < for all i such that V{ = then the likelihood is maximized 
at non-trivial branch lengths. In particular, for the CFN model, the branch lengths are in 
the interval (0,1/2), that is, there are no branches of length or 1/2. Similarly for the 
Jukes- Cantor model the lengths are in (0, 1/4). 

Proof. For notational convenience, let f(w) = ln/x 5 ^. Thus, g(w) = pL T f(lv). Note that 

/(^)=ln/i. (30) 

The function / maps assignments of weights for the 2n — 3 edges of S to the logarithm of 
the distribution induced on the leaves. Hence, the domain of / is the closure of the model, 
which is a subspace of M rf ( 2n_3 ), where d is the dimension of the parameterization of the 
model, e.g., in the CFN and JC models d = 1 and in the K2 model d = 2. We denote the 
closure of the model as A. Note, the range of / is [— oo,0]' Q '™. 

Let Jf = (dfi/dvjj) be the Jacobian of / and Hf = (dfi/dvjjdwk) be the Hessian of / 
(Hf is a rank 3 tensor). 

If if is in the interior of A, then since w optimizes Cg-^(fM) we have 

a T J f {lf) = 0. (31) 

We will now argue that equality Q31JI remains true even when if is on the boundary of A. 
Function /%^> is a polynomial function of the coordinates of v . The coordinates of fJ,g-rf 
sum to the constant 1 function. We assumed that [i = Hg-rf is strictly positive and hence 
for all w in a small neighborhood of v we have that Hgjj is still a distribution (note that 

vj can be outside of A). Suppose that fi T Jf(lf) ^ 0. Then there exists Av such that 
fi T In n s ^> +z yj > fA r h~ifJ*g-tf = /i T ln/i. This contradicts the fact that the maximum (over 

all distributions v) of \fF \n{v) is achieved at v = \i. Thus (|31jl holds. 
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If we perturb fi by xA[i x and v by At; the likelihood changes as follows: 
= (fi + xAfi x ) T f(v + Av) 

= (jj. + xA^) T (/(^) + (J f (lT)) (Av) + l -(Av) T (H f (lT)) (Av) + 0(||A^|| 3 )) 
= f In// + f ((J f (H)) (Av) + i(A^) T fl>(V)(A^)) 

+x(A^ x ) T (f(~v) + Jf(~v*)(Av)\ + O (j|Au|| 3 + x||Au|| 2 ^ 
= ^ T ln M + M T (i(At;) T ^ / (^)(A^)) + x (A/x :c ) T (/(!/) + J/(V)(At5)) 

+o(||A^|| 3 + a;||A^|| 2N ) , (32) 



where in the third step we used (|3*U|) and in the last step we used (|31j) . In terms of H , J 
defined in the statement of the theorem we have 

£ T,l?+A^ + xA ^) = ln/x + x/ix("t?) + -(A^) T H(A^) + xJ x (Av) 

+ (||A^|| 3 + x||A^|| 2 ) . (33) 

We will now prove that H is negative definite. First note that we assumed that H is 
full-rank, thus all of its eigenvalues are non-zero. Moreover, g(w) = Cg^*(n) is maximized 
for w = v . Plugging x = into (|33|) . we obtain 

g(~v +Av) = fi T In fi + ^(Avf H (Av) +0 {\\Av\\ 3 ^j (34) 

Hence, if H was not negative definite, then it would have at least one positive eigenvalue. 
Let z denote the corresponding eigenvector. Let Av be a sufficiently small multiple of 
z . By (|34|) we have g(v + Av) > g(~v) which contradicts the earlier claim that w = v 
maximizes g(w). Hence, all of the eigenvalues of H are negative, i.e., it is negative definite. 
Let A max be the largest eigenvalue of H, i.e., closest to zero. We will use that A max < 0. 
Note, 

(Av) T H(Av) < A max ||A^|| 2 (35) 
Set Av so that £^* + ~]^ ] (n + x ^^x) is maximized. (Av is a function of x.) Now we will 
prove that ||Au|| = 0(x). For any 5 > 0, by Lemma l2lfl for all sufficiently small x, we have 
||Au|| < 5. By (E3J) and ((5HJ) we have 

^S^+aZ^ + xA ^) < At T hr /x + xh x (~v) + A max | | Av\ | 2 + xJ x (Av) 

+ 0(\\Av\f + x\\A^\\ 2 ). 
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Assume x = o(||Ai;||) (i.e., assume that ||Av|| goes to zero more slowly than x). Then, 
we have 

^Slt+AZ^ + - ^ Tln ^ + x hx{~v) + A max ||Aw|| 2 + o(||A-u|| 2 ). 
On the other hand, we have 

£ 5 ^>(/i + xAfi) = (/t + xA/z) T ln/ig-^ = /i T ln/j + xh x (~v). 

Since A max | | Ai>| | 2 is negative then for sufficiently small ||Av|| (recall we can choose any 
5 > where ||Ai;|| < 5), we have 

£ i?+A^ + xA ^ - C ~^^ + xA ^)- 
Thus we may restrict ourselves to Av such that ||Au|| = 0{x). Hence 
Csin + xAfi x ) < [i T In fi + xh x (~v) 

+ max ( -(A^) T H(A^) + xJ x {Aw) J + 0(x 3 ). (36) 

The maximum of 

^(Aw^-f^Aw) + xJ x (Aw) (37) 
occurs at Az := — xif _1 jj; for this Az the value of (|3*T|) is — ^- J X H~ X J x . Therefore, 

£s(/t + iA/i) < // ln/j + xh x {y) - ^J x H~ l J x + 0(x 3 ). 
From //a, — > hq, we have J x H~ l J^ = (1 + o(l)) JqH' 1 Jq , and hence 

£s(/i + xA/x) < /i T In// + xh x (~v) — —JqH~ x Jq + o(x 2 ). 

This completes the proof of the first part of the lemma. It remains to prove the case when 
the inequality can be replaced by equality. 

Note that in general the inequality cannot be replaced by equality, since 1? + xAz can 
be an invalid weight (i.e., outside of A) for all x. If (Av)i > whenever v i = then 
~v + xAz is a valid weight vector for sufficiently small x and hence plugging directly into 
()33|) we have 

C s {fi + xAfi x ) > l x T hx^ + xh x {'v) - ^-J x H~ l Jl + 0(x 3 ) = 

2 2 (38) 



/i ln/i + x/l x (f ) 2^° H ^° + °( X )' 



□ 
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8.3 Proof of Theorem [U in CFN: (7=1/4 

In this section we deal with the CFN model. We prove Part ^ of Theorem |2 For simplicity, 
we first present the proof for the case C = 1/4. 

Let Hi be generated from T3 with weights Pi = (1/4 + x, 1/4 — x, 1/4 — x, 1/4 + x, x 2 ) 
and \i2 be generated from T3 with weights P2 = (1/4 — x, 1/4 + x, 1/4 + x, 1/4 — x, x 2 ). 
Let h = (^1 + /U2)/2. Note that (14) (2 3) fixes and ^2; and (12)(3 4) swaps \x\ and /^2- 
Hence is invariant under X = ((1 2)(34), (1 4)(2 3)). This simplifies many of the following 
calculations. 

One can verify that the Hessian is the same for all the trees: 



H 



f 


-1552 


-16 


-16 


-16 


615 


41 


41 


41 




-16 


-1552 


-16 


-16 




41 


615 


41 


41 




-16 


-16 


-1552 


-16 




41 


41 


615 


41 




-16 


-16 


-16 


-1552 




41 


41 


41 


615 


V 


-80 


-80 


-80 


-80 


123 


123 


123 


123 



123 
-80 
123 
-80 
123 
-80 
123 
400 



\ 



— ou — ou — ou -wu 

123 123 123 369 / 

The above is straightforward to verify in any symbolic algebra system, such as Maple. 
The Jacobians differ only in their last coordinate. For T3 we have 



/1744 1744 1744 1744 



V 615 ' 615 ' 615 ' 615 ' 369 



Finally, 



~2 J ° H 1,J Q 



36328 
1845 



19.68997. 



Hence, for v = (1/4, 1/4, 1/4, 1/4,0), by Lemma 1251 we have 



CsiVx) < M In. A* + xhiv) - 
= /i T In fj, + xh{v) + x 



JoH^jT + 0(x 3 



,36328 
1845 



+ 0(x 3 



For T2 we have 



J = 



71744 1744 1744 1744 -1208 \ 
V 615 ' 615 ' 615 ' 615 ' 369 J 



Then, 



By Lemma l2*Hl we have 



~2 J ° H 



244862 
9225 



26.54331. 



£>s{Hx) < n T lnfi + xh(v) + x' 



244862 
9225 



+ 0(x 3 
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For T\ we have 



Jr. 



/1744 1744 1744 1744 4040 \ 
V"6l^'"6l^'"6l^'"6l5"'"369"y ' 



-7 -7 -7 -7 143 \ 

T'X'X'X'Tby ' 

126118 



(39) 



1845 



68.35664. 



Note that the last coordinate of —H 1 Jo (in (|39j0 is positive and hence we have equality 
in Lemma PHI Thus, 

1261 1 8 

C s {fi x ) = fi T In + xh(v) +x 2 — — — + 0(x 3 ) 

1845 

The largest increase in likelihood is attained on T\. Thus T\ is the tree with highest 
likelihood for sufficiently small x. 



8.4 Proof of Theorem [U in CFN: Arbitrary C 

The Hessian is the same for all four-leaf trees. In this case we state the inverse of the 
Hessian which is simpler to state than the Hessian itself. We have 



H' 



/ 16C 4 -1 
' 32C' 2 





. (4C 2 -1) 2 
\ 128C 3 





16C 4 -1 
32C 2 




(4C 2 -1) 2 
128C 3 







16C 4 -1 
32C* 2 



(4C 2 -1) 2 
128C 3 







16C 4 -1 

32C* 2 
(4C 2 -1) 2 
128C 3 



(4C 2 -1) 2 \ 

128C 3 
(4C 2 -1) 2 

128C 3 
(4C 2 -1) 2 

128C 3 
(4C 2 -1) 2 

128C 3 

(16C 4 -24C 2 -3)(4C 2 -1) 2 . 
256C 4 (1+4C* 2 ) 2 / 



The first 4 coordinates of the Jacobians are equal and have the same value for all trees: 

16C(64C7 6 - 16C 4 + 12C 2 + 1) 
(16C 4 -1)(16C 4 + 24C 2 + 1) ' 

Thus for each tree we will only list the last coordinate of the Jacobian Jo which we denote 
as Jo [5]. 
Let 

(3 = (AC 2 - 1) 2 (16C 4 + 24C 2 + 1), 
7 P 16 
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Note that (3 > and 7 > for C G (0, 1/2). 
For T\ we have 

j [5] = 1 • 128C 2 (16C 6 - 24C 4 + 17C 2 + 1), 

P 

Ai := -JqH^Jq = - • (-512C 12 - 2048C 10 + 3520C 8 - 1856C 6 + 390C 4 + 88C 2 + 3), 
2 7 

and the last coordinate of —H~ l J$ is 

_ 48C 6 - 40C 4 + 15C 2 + 2 
: ~ 2C 2 (1+4C 2 ) 2 ' 

It is easily checked that L is positive for C E (0,1/2) and hence we have equality in 
Lemma 1281 

For T2 we have 

J [5] = - • 128C 4 (16C 4 - 40C 2 - 7), 

P 

A 2 := -J H- l J = - ■ (-512C 12 - 5120C 10 + 960C 8 + 832C 6 + 198C 4 + 28C 2 + 1). 
2 7 

For T3 we have 

J [5] = - • 256C 4 (16C 4 - 8C 2 - 3), 

P 

A 3 := -JqH^Jq = - ■ (2048C 12 - 2048C 10 - 512C 8 + 72C 4 + 24C 2 + 1). 
2 7 

It remains to show that Ai > A2 and Ai > A3. We know that for C = 1/4 the 
inequalities hold. Thus we only need to check that Ai — A2 and Ai — A3 do not have roots 
for C £ (0, 1/2). This is easily done using Sturm sequences, which is a standard approach 
for counting the number of roots of a polynomial in an interval. 

8.5 Proof of Theorem [U in JC, K2, and K3 

Our technique requires little additional work to extend the result to JC, K2, and K3 models. 
Let JC-likelihood of tree T on distribution fi be the maximal likelihood of fiT,w over all 
labelings of w, in the JC model. Similarly we define K2-likelihood and K3-likelihood. Note 
that K3-likelihood of a tree is greater or equal to its K2-likelihood which is greater or equal 
to its JC-likelihood. In the following we will consider a mixture distribution generated from 
the JC model, and look at the likelihood (under a non-mixture) for JC, K2 and K3 models. 
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For the K3 model, the transition matrices are of the form 



-Pk3(«,/3,7) 



/ 1 -a- 

a 



\ 1 



7 



a 

a-0 

7 




7 




7 

a — 
a 



7 

a 



1 



a 



0-1 j 



with a > /5 > 7 > 0, a + < 1/2, and 7 > (a + j)(0 + j). The K2 model is the case = 7, 
the JC model is the case a = /? = 7. 

Theorem 30. Lei f\ = (1/8 + x, 1/8 - x, 1/8 - x, 1/8 + x, x 2 ) and % = (1/8 - 2, 1/8 + 
x, 1/8 + x, 1/8 — x,x 2 ). Let fi x denote the following mixture distribution on T3 generated 
from the JC model: 

X /2- 



/ i .t 



There exists xq > suc/i i/tai /or a// x S (0, xq) the J C -likelihood of T± on [i x is higher than 
the K3-likelihood 0/T2 and T3 on [i x . 

Note, Part [2 of Theorem |^1 for the JC, K2, and K3 models is immediately implied by 
Theorem JM 

First we argue that in the JC model T% is the most likely tree. As in the case for the 
CFN model, because of symmetry, we have the same Hessian for all trees. 

\ 



H 



-2373504 


-915872 


-915872 


-915872 


-587856 


112255 


336765 


336765 


336765 


112255 


-915872 


-2373504 


-915872 


-915872 


-587856 


336765 


112255 


336765 


336765 


112255 


-915872 


-915872 


-2373504 


-915872 


-587856 


336765 


336765 


112255 


336765 


112255 


-915872 


-915872 


-915872 


-2373504 


-587856 


336765 


336765 


336765 


112255 


112255 


-587856 


-587856 


-587856 


-587856 


-1130124 


112255 


112255 


112255 


112255 


112255 



V 

Again the Jacobians differ only in the last coordinate. 
For T 3 : 

/ 4199248 4199248 4199248 4199248 -7085428 



Jo 



V 112255 ' 112255 ' 112255 ' 112255 ' 112255 



(40) 



Then, 

For T 2 : 

Then, 



JqH J, 



1 jT 




174062924259638 



Jo 



2 u 237159005655 

/ 4199248 4199248 4199248 4199248 



733.9503. 
8069818 



V 112255 ' 112255 ' 112255 ' 112255 ' 112255 



(41) 



-JqH 1 J t 



410113105846051 
474318011310 



864.6374. 
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For Ti: 

_ ( 4199248 4199248 4199248 4199248 22878022 \ 
~ V 112255 ' 112255 ' 112255 ' 112255 ' 112255 ) ' 

, _/ -10499073 -10499073 -10499073 -10499073 118305233 \ 

~ V 2816908 ' 2816908 ' 2816908 ' 2816908 ' 4225362 ) ' ^ ' 

Note, again the last coordinate is positive as required for the application of Lemma 1281 
Finally, 

1 T t 1 1221030227753251 

— Jo# </n = ~ 2574.286. 44 

2 u 474318011310 ^ ' 

Now we bound the K3-likelihood of T2 and T3. The Hessian matrix is now 15 x 15. It 
is the same for all the 4-leaf trees and has a lot of symmetry. There are only 8 different 
entries in H. For distinct i,j G [4] we have 

d 2 

-/(^o) = -538996/112255, 



dpidpi 
d 2 

— — f(no) = -605684/1010295, 

dpidpj 

d 2 



dpidpi 

d 2 
dpidn 

d 2 

dpidrj 

d 2 
dpidr 5 

d 2 

dridri 

d 2 



/(^o) = -132304/112255, 

-/(^o) = -126086/112255, 

./(/i ) = -51698/336765, 

f(fi Q ) = -2448/8635, 



/( Mo ) = -268544/112255, 
/( Mo ) = -54082/112255. 



For T3, its Jacobian Jo is a vector of 15 coordinates. It turns out that 3 Jo is the concatena- 
tion of 3 copies of the Jacobian for the JC model which is stated in (|4()(l. Finally, we obtain 

_1 1740629242^ 
2 237159005655 v ; 



36 



For T2 we again obtain that for its Jacobian Jo, 3 Jo is the concatenation of 3 copies of 
(jUJ). Then, 

1 T TT 1 t t 410113105846051 A „ A 

— J Q H~ l Ji = ~ 864.6374. 46 

2 u 474318011310 ^ ; 

Finally, for T%, for its Jacobian Jo, 3 Jo is the concatenation of 3 copies of (|42|) and 
— iJ _1 Jo is the concatenation of 3 copies of (|43|). Then, 

1 = 1221030227753251 

2 u 474318011310 

Note the quantities — ^JqH~ 1 Jq are the same in the K3 model are the same as the 
corresponding quantities in the JC model. It appears that even though the optimization is 
over the K3 parameters, the optimum assignment is a valid setting for the JC model. 

Observe that —tjJqH^ 1 ^ for T\ in the JC model (see is larger than for T2 and 

T3 in the K3 model (see (|46|) and (|45JI ) . Applying Lemma l28l this completes the proof of 
Theorem 1301 



9 MCMC Results 

The following section has a distinct perspective from the earlier sections. Here we are 
generating TV" samples from the distribution and looking at the complexity of reconstructing 
the phylogeny. The earlier sections analyzed properties of the generating distribution, as 
opposed to samples from the distribution. In addition, instead of finding the maximum 
likelihood tree, we are looking at sampling from the posterior distribution over trees. To do 
this, we consider a Markov chain whose stationary distribution is the posterior distribution, 
and analyze the chain's mixing time. 

For a set of data D = (D\, . . . , Djy) where Di £ {0, l} n , its likelihood on tree T with 
transition matrices P is 

H T jt(d) = Prp|T,p) 

= []Pr(75 i |T,P) 

i 

i 

= exp(^ln(/x T ^( J Di)) 

i 

Let $(T, P) denote a prior density on the space of trees where 

E L$(T,P)dP = l. 
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Our results extend to priors that are lower bounded by some e as in Mossel and Vigoda |19j . 
In particular, for all T, P, we require $(T, P) > e. We refer to these priors as e-regular 
priors. 

Applying Bayes law we get the posterior distribution: 



Pr I D | T, P $(T, P) 

Pr ( T, P | D ' 



Pr (^D 

Pr (d\t,p )$(t, p; 



Note that for uniform priors the posterior probability of a tree given D is proportional 
to Pr (d\t). 

Each tree T then has a posterior weight 



w(T) = J Pr p | T, P) $(T, P)dP . 



We look at Markov chains on the space of trees where the stationary distribution of a tree 
is its posterior probability. We consider Markov chains using nearest-neighbor interchanges 
(NNI). The transitions modify the topology in the following manner which is illustrated in 
Figure El 

Let St denote the tree at time t. The transition St — > St+i of the Markov chain is defined 
as follows. Choose a random internal edge e = (u, v) in S. Internal vertices have degree 
three, thus let a, b denote the other neighbors of u and y, z denote the other neighbors of v. 
There are three possible assignments for these 4 subtrees to the edge e (namely, we need to 
define a pairing between a, b, y, z). Choose one of these assignments at random, denote the 
new tree as S'. We then set St+i = S' with probability 

min{l, w(S')/w(S t )}.} (47) 

With the remaining probability, we set St+i = St- 

The acceptance probability in (|47j) is known as the Metropolis filter and implies that 
the unique stationary distribution ir of the Markov chain statisfies, for all trees T: 

AT)- W{T) 



£ T ,K>(r<) 



We refer readers to Felsenstein jH] and Mossel and Vigoda jTS] for a more detailed 
introduction to this Markov chain. We are interested in the mixing time r m j x , defined as 



3S 



Figure 3: NNI transitions. 



the number of steps until the chain is within variation distance < 1/4 of the stationary 
distribution. The constant 1/4 is somewhat arbitrary, and can be reduced to any 5 with 
T mix log(l/5) steps. 

For the MCMC result we consider trees on 5 taxa. Thus trees in this section will have 
leaves numbered 1,2,3,4,5 and internal vertices numbered 6,7,8. Let S3 denote the tree 



(((12), 5), (34)). Thus, S 3 has edges ei = {1,6}, e 2 = {2,6}, e 3 = {3,8}, e 4 = {4,8}, 



e§ = {5,7}, e$ = {6,7}, ej = {7,8}. We will list the transition probabilities for the edges 
of S3 in this order. For the CFN model, we consider the following vector of transition 
probabilities. Let 



Pi 
P2 



(1/4 + x, 1/4 - x, 1/4 - x, 1/4 + x, 1/4, c, c) 
(1/4 - x, 1/4 + x, 1/4 + x, 1/4 - x, 1/4, c, c) 



and 



where 




For the JC model 



let 



(1/8 + x, 1/8 - x, 1/8 - x, 1/8 + x, 1/8, c', d 
(1/8 - x, 1/8 + x, 1/8 + x, 1/8 - x, 1/8, d,d 



and 



where 



d = 16x 2 . 
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Let Lbi — Mc ~& an( l M2 = Me • We are interested in the mixture distribution: 

A* = 2 (Mi +Ma) 

Let 5 2 denote the tree (((14), 5), (23)). 

The key lemma for our Markov chain result states that under //, the likelihood has local 
maximum, with respect to NNI connectivity, on Si and Sa- 

Lemma 31. For the CFN and JC models, there exists xq > such that for all x £ (0, xo) 
i/ien /or a/1 trees S that are one NNI transition from Si or S2 , we have 

CsQi) < Cs^), C s (ji) < £s 2 (m) 

This then implies the following corollary. 

Theorem 32. There exist a constant C > such that for all e > the following holds. 
Consider a data set with N characters, i.e., D = (Di, . . . , Djy), chosen independently 
from the distribution \i. Consider the Markov chains on tree topologies defined by nearest- 
neighbor interchanges (NNI). Then with probability 1 — exp(— CN) over the data generated, 
the mixing time of the Markov chains, with priors which are e-regular, satisfies 

T mix > eexp(CN). 

The novel aspect of this section is Lemma l3Tl The proof of Theorem E21 using Lemma I3l1 
is straightforward. 

Proof of Lemma \S1\ The proof follows the same lines as the proof of Theorems and 1301 
Thus our main task is to compute the Hessian and Jacobians, for which we utilize Maple. 

We begin with the CFN model. 

The Hessian is the same for all 15 trees on 5deaves: 



H = 



-3880 


-114 


-114 


-114 


-114 


-1241 


-190 


1281 


427 


427 


427 


427 


1281 


427 


-114 


-3880 


-114 


-114 


-114 


-1241 


-190 


427 


1281 


427 


427 


427 


1281 


427 


-114 


-114 


-3880 


-114 


-114 


-190 


-1241 


427 


427 


1281 


427 


427 


427 


1281 


-114 


-114 


-114 


-3880 


-114 


-190 


-1241 


427 


427 


427 


1281 


427 


427 


1281 


-114 


-114 


-114 


-114 


-3880 


-190 


-190 


427 


427 


427 


427 


1281 


427 


427 


-1241 


-1241 


-190 


-190 


-190 


-6205 


-950 


1281 


1281 


427 


427 


427 


3843 


1281 


-190 


-190 


-1241 


-1241 


-190 


-950 


-6205 


427 


427 


1281 


1281 


427 


1281 


3843 



We begin with the two trees of interest: 
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3 5 



Their Jacobian is 



and 



Thus, 



_ / 17056 17056 17056 17056 2432 57952 57952 \ 
~ V 1281 ' 1281 ' 1281 ' 1281 ' 427 ' 3843 ' 3843 ) 



-H~ x Jq = (2,2,2,2,0,4,4) . 



Note the last two coordinates are positive, hence we get equality in the conclusion of 
Lemma l28l Finally, 

2 u 3843 

We now consider those trees connected to S\ and S2 by one NNI transition. Since there 
are 2 internal edges each tree has 4 NNI neighbors. 
The neighbors of Si are 

3 13 2 1 4 1 3 



The neighbors of S2 are 
2 4 2 



1 



1 1 



The Jacobian for all 8 of these trees is 

/ 17056 17056 17056 2432 17056 57952 3728 \ 



Ji 



V 1281 ' 1281 ' 1281 ' 427 ' 1281 ' 3843 ' 427 J 



Finally, 



1 2242633984 

2 20840589 



Note the quantities —^JqH are larger for the two trees Si and S2. This completes 
the proof for the CFN model. 
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We now consider the JC model. Again the Hessian is the same for all the trees: 



H 



/ -512325018 


-36668964 


-36668964 


-36668964 


-36668964 


-328636353 


-28145979 


20541185 


20541185 


20541185 


20541185 


20541185 


41082370 


8216474 


-36668964 


-512325018 


-36668964 


-36668964 


-36668964 


-328636353 


-28145979 


20541185 


20541185 


20541185 


20541185 


20541185 


41082370 


8216474 


-36668964 


-36668964 


-512325018 


-36668964 


-36668964 


-28145979 


-328636353 


20541185 


20541185 


20541185 


20541185 


20541185 


8216474 


41082370 


-36668964 


-36668964 


-36668964 


-512325018 


-36668964 


-28145979 


-328636353 


20541185 


20541185 


20541185 


20541185 


20541185 


8216474 


41082370 


-36668964 


-36668964 


-36668964 


-36668964 


-512325018 


-28145979 


-28145979 


20541185 


20541185 


20541185 


20541185 


20541185 


8216474 


8216474 


-328636353 


-328636353 


-28145979 


-28145979 


-28145979 


-1273864167 


-134747901 


41082370 


41082370 


8216474 


8216474 


8216474 


82164740 


20541185 


. -28145979 


-28145979 


-328636353 


-328636353 


-28145979 


-134747901 


-1273864167 


\ 8216474 


8216474 


41082370 


41082370 


8216474 


20541185 


82164740 



Beginning with tree S\ 
1 3 



We have: 

_ / 4342624176 4342624176 4342624176 4342624176 1733695536 5655197244 5655197244 \ 
°~ V 20541185 ' 20541185 ' 20541185 ' 20541185 ' 20541185 ' 20541185 ' 20541185 ) 

The last two coordinates of —H^ 1 Jq are 

5114490004637540016 5114490004637540016 



593018923302763639 ' 593018923302763639 
Since they are positive we get equality in the conclusion of Lemma 1281 Finally, 



1 



2 ° 12181311412062878919972215 



48101472911555370428804991552 



3948.793 



For the neighbors of S\: 
3 2 3 



1 



Jt 



We have: 

/ 4342624176 4342624176 4342624176 1733695536 4342624176 5655197244 2955839412 



V 20541185 ' 20541185 ' 20541185 ' 20541185 ' 20541185 ' 20541185 ' 20541185 
Hence, 

1 , rp 56725804836101083569837263821270061643565080096 

~2 1 15568481282665727860752794372508821870798435 
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Considering S2: 



2 1 




3 5 4 



_ / 4342624176 4342624176 4342624176 4342624176 1733695536 1074039432 1074039432 \ 
~ V 20541185 ' 20541185 ' 20541185 ' 20541185 ' 20541185 ' 4108237 ' 4108237 ) ' 

The last two coordinates of —H^ 1 Jq are 

13458396942990580792 13458396942990580792 



1779056769908290917 ' 1779056769908290917 ' 
which are positive. Finally, 

1 T „ 1 T r 45365294744197291555715368032 „ t m 
— J H~ X M = « 3724.172 

2 u u 12181311412062878919972215 

Considering the neighbors of S2: 
1 3 1 2 2 4 2 1 



Jo 
Hence 



425435315345 
We have: 

/ 4342624176 4342624176 4342624176 1733695536 4342624176 1074039432 2955839412 



V 20541185 ' 20541185 ' 20541185 ' 20541185 ' 20541185 ' 4108237 ' 20541185 



1 T ..-1 T r 7756149367472421142972629871553505755962112808 J „ 
-JnH Jn = 3487.369 

2 2224068754666532551536113481786974552971205 



□ 



Remark 33. In fact S\ and S2 have larger likelihood than any of the 13 other 5-leaf trees. 
However, analyzing the likelihood for the 5 trees not considered in the proof of Lemma 
requires more technical work since —^JoH -1 ^ is maximized at invalid weights for these 5 
trees. 

We now show how the main theorem of this section easily follows from the above lemma. 
The proof follows the same basic line of argument as in Mossel and Vigoda ^H]) we point 
out the few minor differences. 
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Proof of Theorem \3 t A For a set of characters D = (D\, . . . ,Dn), define the maximum log- 
likelihood of tree T as 



C T {D) = max £ T -*(!>), 



where 



Consider D = . . . , -Dat) where each Di is independently sampled from the mixture 
distribution fj,. Let S* be Si or S2, and let 5* be a tree that is one NNI transition from S 1 *. 
Our main task is to show that Cs*(D) > Cs{D). 

Let P* denote the assignment which attains the maximum for £$* (//) , and let 

a = min Pr (a I S* , P* 

For a £ ft 5 , let D{a) = \{i : A = a}\. By Chernoff's inequality (e.g., [TTJ Remark 2.5]), 
and a union bound over a £ ft 5 , we have for all 5 > 0, 

Pr ( for all a £ ft 5 , \D(a) - n(a)N\ < SN) > 1 - 2 • 4 5 exp(-25 2 7V) = 1 - exp(-ft(AT)). 

Assuming the above holds, we then have 

C s (d)<N(l-S)C s (ji). 

And, 

Cs*{D) >C s . t pt(D) > ^(l-5)£s*(M)+4 5 ^log(a) 
Let 7 := £ pt(yu) ~~ ^p^)- Note, Lemma I3T1 states that 7 > 0. 
Set 5 : = fSg' Note, 4" 5 > 5 > 0. Hence, 

C S *(D) - C S (D) > N(l - 5)7 - Nj/W > N-//5 = n(N). 

This then implies that w(S)/w(S*) < exp(ft(AT)) with probability > 1 - ft(iV) by the 
same argument as the proof of Lemma 21 in Mossel and Vigoda 18 j. Then, the theorem 
follows from a conductance argument as in Lemma 22 in |18j . 

□ 
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